110 research outputs found

    Automated Morphological Segmentation and Evaluation

    Get PDF
    In this paper we introduce (i) a new method for morphological segmentation of part of speech labelled German words and (ii) some measures related to the MDL principle for evaluation of morphological segmentations. The segmentation algorithm is capable to discover hierarchical structure and to retrieve new morphemes. It achieved 75 % recall and 99 % precision. Regarding MDL based evaluation, a linear combination of vocabulary size and size of reduced deterministic finite state automata matching exactly the segmentation output turned out to be an appropriate measure to rank segmentation models according to their quality

    Multi-Tier Annotations in the Verbmobil Corpus

    Get PDF
    In very large and diverse scientific projects where as different groups as linguists and engineers with different intentions work on the same signal data or its orthographic transcript and annotate new valuable information, it will not be easy to build a homogeneous corpus. We will describe how this can be achieved, considering the fact that some of these annotations have not been updated properly, or are based on erroneous or deliberately changed versions of the basis transcription. We used an algorithm similar to dynamic programming to detect differences between the transcription on which the annotation depends and the reference transcription for the whole corpus. These differences are automatically mapped on a set of repair operations for the transcriptions such as splitting compound words and merging neighbouring words. On the basis of these operations the correction process in the annotation is carried out. It always depends on the type of the annotation as well as on the position and the nature of the difference, whether a correction can be carried out automatically or has to be fixed manually. Finally we present a investigation in which we exploit the multi-tier annotations of the Verbmobil corpus to find out how breathing is correlated with prosodic-syntactic boundaries and dialog acts. 1

    Investigation of Language Structure by Means of Language Models Incorporating Breathing and Articulatory Noise

    No full text
    In our experiment we used a bigram language model and a standard speech recogniser to test if linguistic information is related to the position of silence, articulatory noise, background noise, laughing and breathing in spontaneous speech. We observed that for silence and articulatory noise the acoustic modelling is more important than linguistic information represented in the bigrams of a language model. Breathing carries useful information that can be described in a language model, because including it into the language model improves test set perplexity and recognition accuracy. This means that precisely defined noise items add some linguistic knowledge to the language model and contribute to a better performance of an automatic speech recogniser. 1

    Characterising a Database of Spoken German by Techniques of Data Mining

    No full text
    When designing and characterising large speech corpora one frequently faces the problem of how the growth of vocabulary is related to the number of words uttered or written. In the past there have been several attempts to find a formula that describes this relation efficiently. In this article we will try to derive such a formula for a spontaneous speech corpus. The narrow scenario specifications of the examined dialogues instruct two speakers to negotiate a certain task which is almost the same for all the recording sessions (see Table 1) and therefore has a limited vocabulary. This makes the difference to the classical work in this field, which was done on written text like journal or newspaper articles with a broader range of vocabulary. Introduction In the forties of this century scientists started to work on a formula that describes how the growth of vocabulary is related to the number of written words in a text. J. W. Chotlos (1944) worked on essays written by American pupils, ..

    Durational Aspects in Turn Taking

    No full text
    On the basis of headset microphone multichannel dialogue recordings we show that pause and overlap durations are Gaussian distributed, when projected on a logarithmic scale. This implicitly challenges Sacks, Schegloff and Jefferson's principle: "Minimize gap and overlap". We compare turn taking patterns of the languages German, American English and Japanese and calculate time constants that might be relevant for the design of automatic dialogue management systems
    corecore